Practical Synthetic Data Generation by Khaled El Emam Lucy Mosquera Richard Hoptroff

Practical Synthetic Data Generation by Khaled El Emam Lucy Mosquera Richard Hoptroff

Author:Khaled El Emam, Lucy Mosquera, Richard Hoptroff
Language: eng
Format: mobi, pdf, pdf
Publisher: O'Reilly Media, Inc.
Published: 2020-05-19T00:00:00+00:00


Chapter 5. Methods for Synthesizing Data

After describing some basic methods for distribution fitting in the last chapter, we will now use these concepts to generate synthetic data. We will start off with some basic approaches and build up to some more complex ones as the chapter progresses. We will refer to more advanced techniques later on that are beyond the scope of an introductory text, but what we cover should give you a good introduction.

Generating Synthetic Data from Theory

Let’s consider the situation where the analyst does not have any real data to start off with, but has some understanding of the phenomenon that they want to model and generate data for. For example, let’s say that we want to generate data reflecting the relationship between height and weight. It is generally known that height and weight are positively associated.

According to the Centers for Disease Control, the average height for men in the US is approximately 175 cm,1 and for the sake of our example we will assume a standard deviation of 5 cm. The average weight is 89.7 kg, and we will assume a standard deviation of 10 kg. For the sake of our example, we will model these as normal (Gaussian or bell-shaped) distributions and assume that the correlation between them is 0.5. According to Cohen’s guidelines for the interpretation of effect sizes, a correlation of magnitude equal to 0.5 is considered to be large, 0.3 is considered to be medium, and 0.1 is considered to be small. Any correlation above 0.5 would be a strong correlation in practice.2 Therefore, at 0.5 we are assuming a large correlation between height and weight. Based on these specifications, we can create a dataset of 5,000 observations that models this phenomenon.

We will present three ways to do this: (a) sampling from multivariate (normal) distributions, (b) inducing a correlation during the sampling process, and (c) using copulas. Each will be illustrated below.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.